Tokyo Tech at TRECVID 2008

نویسندگان

  • Shanshan Hao
  • Yusuke Yoshizawa
  • Koji Yamasaki
  • Koichi Shinoda
  • Sadaoki Furui
چکیده

The Tokyo Institute of Technology team participated in the high-level feature extraction, surveillance event detection pilot and Rushes summarization tasks for TRECVID2008. In the high-level feature (HLF) extraction task, we employed a framework using a tree-structured codebook and a node selection technique last year. This year we focused on the position information of each object-related HLF. During the training phase, we applied our method not on the whole key-frame images, but on the regions in the image which contain the annotated HLF only. From the evaluation of TRECVID2008, the inferred average precisions of the three runs are all 0.011. The method we improved this year doesn’t contribute to a better performance. In surveillance event detection pilot task we use optical flow features and an SVM (Support Vector Machine) to detect each surveillance event. We present our preliminary experimental results in this paper. In the rushes summarization task, we estimated the number of scenes for a summary using minimum description length. We use two low-level features, the YCbCr color histogram and optical flow. 1. High-Level Feature extraction In the high-level feature (HLF) extraction task in TRECVID2007, we proposed a novel method, which used a tree-structured codebook and a node selection technique for HLF extraction from video data. In this method, we first construct a tree-structured codebook shared among all the HLFs by clustering SIFT descriptors [3]. Then, we select nodes to be used as visual words for each HLF. Since motion information is important for some HLFs with movement, we also employed motion words which are defined as visual words with motion activities. Since there are often many uninterested parts in a key-frame image in which the HLFs are annotated to exist, extracting features from the whole key-frame image rather than from the exact areas where we are interested may bring a lot of noise. In order to reduce the noise impact to our system, in this year’s task, we improved this framework by taking into consideration location information of each HLF. In the annotation data, which is annotated and provided by MCG-ICT-CAS [4], the 20 HLFs of this year’s task are divided into two groups: object-related HLFs and scene-related HLFs. The object-related HLFs can be located in rectangle regions, for example, bridge, dog, and two people. In contrast, the scene-related HLFs can not be located in exact regions, for example, classroom, kitchen and cityscape. In our experiment, for the object-related HLFs, we focus on the annotated regions in the key-frame images. For the scene-related HLFs which have no region annotations, we use the whole key-frame images as the location of these HLFs. The remainder of this section is organized as follows: Subsection 1.1-1.3 will introduce our system framework and method we proposed last year. Subsection 1.5 will describe the location information used by our method. Subsection 1.6 will report our experimental results. Subsection 1.7 concludes our work of HLF extraction task. 1.1 High-level feature extraction system Figure 1 shows our HLF extraction system. We first select key-frame images from videos. Then we extract Harris-Affine regions [5] from each key-frame image we have selected and then describe each region with a 128 dimensional SIFT descriptor (4x4-grid, 8orientation). We use Region Descriptor software [9] provided by Visual Geometry Group [10] with its default parameters to extract the affine-invariant regions. These SIFT descriptors are then quantized with a binary tree cluster codebook which is constructed in advance for all the HLFs from training data. For each HLF, we select nodes from the tree-codebook and use the selected nodes as visual words. In addition, we also extract motion words which are defined as visual words with motion activities. Then for each key-frame, we count the occurrences of each visual words and motion words to build a feature vector in which each element represents the number of occurrence of each word. At the same time, we also construct a feature vector using Hessian-Affine regions, and concatenate the aforementioned feature vector with the feature vector constructed using Harris-Affine regions for a classifier. With these feature vectors, we use a maximum entropy model (MEM) [6] to classify the presence/absence of each high-level feature. A MEM estimates the posterior distribution of a label (presence or absence) given to the features of a key-frame image. We use the implementation of MALLET [7] in our system. Figure 1. High-level feature extraction system 1.2 Visual word We use a 128 dimensional SIFT descriptor to describe each region we extracted from a key-frame image in the training dataset. Then we construct a tree-structured codebook which is shared among all the HLFs by recursively clustering all the SIFT descriptors in the training data. Finally, we select nodes from the tree-structured codebook and use them as visual words for each HLF. We will explain the construction process of tree-structured codebook and the method of node selection in detail. 1.2.1 Tree-structured codebook We use the K-means clustering method to construct our binary tree cluster codebook. The Euclidean distance between vectors is used as the distance measure in the clustering process. First, we place all the SIFT descriptors to a root node, and then we divide these descriptors into two clusters, which are defined as the child nodes of the root node. We continue dividing each child cluster into two clusters recursively until the number of all the clusters arrives to a predetermined threshold Sa. Finally, in each node we replace all the allocated descriptors with the mean of these descriptors. After this replacing step we get a binary tree structure. 1.2.2 Node selection For each HLF, we compare descriptors of each key-frame where this HLF is present with all the nodes of the tree-structured code book. We select all the nodes that conform to certain rules which we will explain later. The process of node selection and rules are described in detail as follows. (0) Suppose Di is a set of SIFT descriptors extracted from the key-frames where HLFi is present,T is the tree structured codebook. (1) Each SIFT descriptor of Di is quantized using the leaf node set of T, and we set the number of occurrences to each node. (2) Starting from leaf nodes, we add up the number of occurrences in the leaf nodes to their corresponding parent node. (3) Starting from each of the leaf nodes, we select nodes whose occurrence number is over a predetermined threshold S. We defined the set of selected nodes as the dictionary for HLFi, and the number of selected nodes as the size of this dictionary. Figure 2 shows an example of node selection with different threshold S. Figure 2. Examples of the node selection with different threshold S for HLFi. The number inside a node represents the occurrence number of the HLFi. Black nodes represent the nodes that are selected. 1.3 Motion word Motion information was very helpful for some features such as airplane_flying, bus and emergency_vehicle in our previous experiment. We pay close attention to the regions where the motion is active. The visual word of these regions is chosen as the motion word of these regions. We determine a visual word to be moving or not according to the following process. First, the differences between two neighboring frames are calculated. Then the motion activity value of each region is obtained by taking the average of these differences. Finally, the motion activity value is compared to a predetermined threshold Tm to determine whether the motion of the region is active or not. The motion activity information we get and the region’s visual word are combined to form a “motion word”. 1.4 Modeling For each key-frame, we build a feature vector in which each element represents the occurrence number of visual words and motion words. With these feature vectors, we use a maximum entropy model (MEM) to classify the presence/absence of each high-level feature. A MEM estimates the posterior distribution of a label (presence or absence) given to the features of a key-frame image: ) 1 ( ) , ( exp ) ( 1 ) | ( ∑ = i i i y x f x Z x y P λ , where x is a feature vector, y is a binary variable representing presence or absence of the HLF, Z(x) is a partitioning function, λi is a model parameter,and fi(x, y) is a feature function.We used the following feature function: . , 0 ' , ) , ( ' , ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ = = otherwise y y if x y x f i y i We use the Limited Memory BFGS method to estimate the parameter set {λi}. 1.5 Region information Since there are many uninteresting areas in the key-frame images in which the HLFs are annotated to exist, extracting features from the whole key-frame images rather than from the exact areas where we are interested may bring a lot of noise. In order to reduce the noise impact on our system, we focus on the exact location areas of each HLF, rather than the whole key-frame images. This year we used the location information of each object-related HLF which is defined by MCG-ICT-CAS [4]. The local HLF can be located by a rectangle whose coordinates are recorded and provided by MCG-ICT-CAS. We use this information in our experiment. When the region selection explained in Subsection 1.1 is performed, it is reasonable to restrict the process inside the rectangle only. We focus on these particular areas when we extract the visual words of 14 object-related HLFs (shown in Table 1). For the other six scene-related HLFs (shown in Table 1.), since no location information is available, we don’t have the region restrictions. Figure 3 shows an example of location information for one HLF: hand. We employ this location information to our system in order to examine whether it contributes to our system. Figure 3. Example of the local information for HLF “hand”. The rectangle regions are the target areas.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Knowledge Base Retrieval at TRECVID 2008

This paper describes the Knowledge Base multimedia retrieval system for the TRECVID 2008 evaluation. Our focus this year is on query analysis and the creation of a topic knowledge base using external knowledge base information.

متن کامل

LIG and LIRIS at TRECVID 2008: High Level Feature Extraction and Collaborative Annotation

This paper describes our participations of LIG and LIRIS to the TRECVID 2008 High Level Features detection task. We evaluated several fusion strategies and especially rank fusion. Results show that including as many low-level and intermediate features as possible is the best strategy, that SIFT features are very important, that the way in which the fusion from the various low-level and intermed...

متن کامل

The Lowlands Team at TRECVID 2008

Type Run Description MAP/mean infAP HLF Official

متن کامل

TRECVID-08 Participation at Bama and UNC-Chapel Hill

Researchers from the University of Alabama and the University of North Carolina at Chapel Hill collaborated on this study for TRECVID-08. This study focused on the search task of TRECVID-08, and the experiments included two full search runs, one interactive and one manual. Both search runs, M_C_2_ViewFinderALNC_2 and I_C_1_ViewFinderALNC_1, had similar designs. Each was conducted using the View...

متن کامل

Bilkent University Multimedia Database Group at TRECVID 2008

Bilkent University Multimedia Database Group (BILMDG) participated in two tasks at TRECVID 2008: content-based copy detection (CBCD) and high-level feature extraction (FE). Mostly MPEG-7 [1] visual features, which are also used as low-level features in our MPEG-7 compliant video database management system, are extracted for these tasks. This paper discusses our approaches in each task.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008